# Computations
import numpy as np
import pandas as pd
import scipy.stats as stats
# sklearn
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score, KFold, StratifiedShuffleSplit
from sklearn.feature_selection import RFE
from sklearn import datasets
from sklearn import metrics
from sklearn.svm import SVC
# Visualisation libraries
## Text
from colorama import Fore, Back, Style
from IPython.display import Image, display, Markdown, Latex, clear_output
## progressbar
import progressbar
## plotly
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.offline as py
from plotly.subplots import make_subplots
import plotly.express as px
## seaborn
import seaborn as sns
## matplotlib
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse, Polygon
from matplotlib.font_manager import FontProperties
import matplotlib.colors as mcolors
plt.style.use('seaborn-whitegrid')
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams['text.color'] = 'k'
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

In this article, we compare a number of classification methods for the breast cancer dataset. The details regarding this dataset can be found in Diagnostic Wisconsin Breast Cancer Database [1]. We would use the following classification methods and then compare them in terms of performance.
data = datasets.load_breast_cancer()
Data = pd.DataFrame(data['data'], columns = [x.title() for x in data['feature_names']])
Labels_dict = dict(zip(list(np.sort(np.unique(data['target'].tolist()))),
list([x.title() for x in data['target_names']])))
Target = 'Diagnosis'
Data[Target] = data['target']
display(Data)
print(data['DESCR'])
| Mean Radius | Mean Texture | Mean Perimeter | Mean Area | Mean Smoothness | Mean Compactness | Mean Concavity | Mean Concave Points | Mean Symmetry | Mean Fractal Dimension | ... | Worst Texture | Worst Perimeter | Worst Area | Worst Smoothness | Worst Compactness | Worst Concavity | Worst Concave Points | Worst Symmetry | Worst Fractal Dimension | Diagnosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.30010 | 0.14710 | 0.2419 | 0.07871 | ... | 17.33 | 184.60 | 2019.0 | 0.16220 | 0.66560 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | 0 |
| 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.08690 | 0.07017 | 0.1812 | 0.05667 | ... | 23.41 | 158.80 | 1956.0 | 0.12380 | 0.18660 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | 0 |
| 2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.19740 | 0.12790 | 0.2069 | 0.05999 | ... | 25.53 | 152.50 | 1709.0 | 0.14440 | 0.42450 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | 0 |
| 3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.24140 | 0.10520 | 0.2597 | 0.09744 | ... | 26.50 | 98.87 | 567.7 | 0.20980 | 0.86630 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | 0 |
| 4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.19800 | 0.10430 | 0.1809 | 0.05883 | ... | 16.67 | 152.20 | 1575.0 | 0.13740 | 0.20500 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 564 | 21.56 | 22.39 | 142.00 | 1479.0 | 0.11100 | 0.11590 | 0.24390 | 0.13890 | 0.1726 | 0.05623 | ... | 26.40 | 166.10 | 2027.0 | 0.14100 | 0.21130 | 0.4107 | 0.2216 | 0.2060 | 0.07115 | 0 |
| 565 | 20.13 | 28.25 | 131.20 | 1261.0 | 0.09780 | 0.10340 | 0.14400 | 0.09791 | 0.1752 | 0.05533 | ... | 38.25 | 155.00 | 1731.0 | 0.11660 | 0.19220 | 0.3215 | 0.1628 | 0.2572 | 0.06637 | 0 |
| 566 | 16.60 | 28.08 | 108.30 | 858.1 | 0.08455 | 0.10230 | 0.09251 | 0.05302 | 0.1590 | 0.05648 | ... | 34.12 | 126.70 | 1124.0 | 0.11390 | 0.30940 | 0.3403 | 0.1418 | 0.2218 | 0.07820 | 0 |
| 567 | 20.60 | 29.33 | 140.10 | 1265.0 | 0.11780 | 0.27700 | 0.35140 | 0.15200 | 0.2397 | 0.07016 | ... | 39.42 | 184.60 | 1821.0 | 0.16500 | 0.86810 | 0.9387 | 0.2650 | 0.4087 | 0.12400 | 0 |
| 568 | 7.76 | 24.54 | 47.92 | 181.0 | 0.05263 | 0.04362 | 0.00000 | 0.00000 | 0.1587 | 0.05884 | ... | 30.37 | 59.16 | 268.6 | 0.08996 | 0.06444 | 0.0000 | 0.0000 | 0.2871 | 0.07039 | 1 |
569 rows × 31 columns
.. _breast_cancer_dataset:
Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------
**Data Set Characteristics:**
:Number of Instances: 569
:Number of Attributes: 30 numeric, predictive attributes and the class
:Attribute Information:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)
The mean, standard error, and "worst" or largest (mean of the three
worst/largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 0 is Mean Radius, field
10 is Radius SE, field 20 is Worst Radius.
- class:
- WDBC-Malignant
- WDBC-Benign
:Summary Statistics:
===================================== ====== ======
Min Max
===================================== ====== ======
radius (mean): 6.981 28.11
texture (mean): 9.71 39.28
perimeter (mean): 43.79 188.5
area (mean): 143.5 2501.0
smoothness (mean): 0.053 0.163
compactness (mean): 0.019 0.345
concavity (mean): 0.0 0.427
concave points (mean): 0.0 0.201
symmetry (mean): 0.106 0.304
fractal dimension (mean): 0.05 0.097
radius (standard error): 0.112 2.873
texture (standard error): 0.36 4.885
perimeter (standard error): 0.757 21.98
area (standard error): 6.802 542.2
smoothness (standard error): 0.002 0.031
compactness (standard error): 0.002 0.135
concavity (standard error): 0.0 0.396
concave points (standard error): 0.0 0.053
symmetry (standard error): 0.008 0.079
fractal dimension (standard error): 0.001 0.03
radius (worst): 7.93 36.04
texture (worst): 12.02 49.54
perimeter (worst): 50.41 251.2
area (worst): 185.2 4254.0
smoothness (worst): 0.071 0.223
compactness (worst): 0.027 1.058
concavity (worst): 0.0 1.252
concave points (worst): 0.0 0.291
symmetry (worst): 0.156 0.664
fractal dimension (worst): 0.055 0.208
===================================== ====== ======
:Missing Attribute Values: None
:Class Distribution: 212 - Malignant, 357 - Benign
:Creator: Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian
:Donor: Nick Street
:Date: November, 1995
This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2
Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass. They describe
characteristics of the cell nuclei present in the image.
Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree. Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.
The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].
This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/
.. topic:: References
- W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction
for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on
Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
San Jose, CA, 1993.
- O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
prognosis via linear programming. Operations Research, 43(4), pages 570-577,
July-August 1995.
- W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994)
163-171.
As can be seen, the number of instances is 569 and the number of attributes is 32. The object of the exercise is to create a classification model that can classify the type of Diagnosis base on the rest of the attributes. However, first, let's plot a count plot for Diagnosis attribute.
Moreover, high variance for some features can hurt our modeling process. For this reason, we would like to standardize features by removing the mean and scaling to unit variance.
def Feature_Normalize(X, PD):
def List_Break(mylist, n = PD['word_break']):
Out = []
for x in mylist:
y = x.split()
if len(y)> n:
z = ' '.join(y[:n])
sep = np.arange(0, len(y), n)[1:]
for n in sep:
z = z + '\n'+ ' '.join(y[n:])
else:
z = ' '.join(y)
Out.append(z)
return Out
scaler = preprocessing.StandardScaler()
X_std = scaler.fit_transform(X)
X_std = pd.DataFrame(data = X_std, columns = X.columns)
fig, ax = plt.subplots(2, 1, figsize = PD['figsize'])
ax = ax.ravel()
CP = [sns.color_palette("OrRd", 20), sns.color_palette("Greens", X.shape[1])]
Names = ['Variance of the Features', 'Variance of the Features (Standardized)']
Sets = [X, X_std]
kws = dict(label='Feature\nVariance', aspect=10, shrink= .3)
for i in range(len(ax)):
Temp = Sets[i].var().sort_values(ascending = False).to_frame(name= 'Variance').round(2).T
_ = sns.heatmap(Temp, ax=ax[i], annot=True, square=True, cmap = CP[i],
linewidths = 0.8, vmin=0, vmax=Temp.max(axis =1)[0], annot_kws={"size": PD['annot_text_size']},
cbar_kws=kws)
if not PD['word_break'] == None:
mylist = List_Break(Temp.T.index.tolist())
_ = ax[i].xaxis.set_ticklabels(mylist)
_ = ax[i].set_yticklabels('')
_ = ax[i].set_title(Names[i], weight='bold', fontsize = 14)
_ = ax[i].set_aspect(1)
del Temp
plt.subplots_adjust(hspace=PD['hspace'])
Out = pd.DataFrame(data = X_std, columns = X.columns.tolist())
return Out
X = Data.drop(columns = [Target])
y = Data[Target]
PD = dict(figsize = (20, 8), hspace = 0.2, annot_text_size = 8, word_break = None)
X = Feature_Normalize(X, PD)
def DatasetTargetDist(Inp, Target, Labels_dict, PD):
# Table
Table = Inp[Target].value_counts().to_frame('Count').reset_index(drop = False).rename(columns = {'index':Target})
Table[Target] = Table[Target].replace(Labels_dict)
Table['Percentage'] = np.round(100*(Table['Count']/Table['Count'].sum()),2)
fig = make_subplots(rows=1, cols=2, horizontal_spacing = 0.02, column_widths=PD['column_widths'],
specs=[[{"type": "table"},{"type": "pie"}]])
# Right
fig.add_trace(go.Pie(labels=Table[Target].values, values=Table['Count'].values,
pull=PD['pull'], textfont=dict(size= PD['textfont']),
marker=dict(colors = PD['PieColors'], line=dict(color='black', width=1))), row=1, col=2)
fig.update_traces(hole=PD['hole'])
fig.update_layout(height = PD['height'], legend=dict(orientation="v"), legend_title_text= PD['legend_title'])
# Left
T = Table.copy()
T['Percentage'] = T['Percentage'].map(lambda x: '%%%.2f' % x)
Temp = []
for i in T.columns:
Temp.append(T.loc[:,i].values)
fig.add_trace(go.Table(header=dict(values = list(Table.columns), line_color='darkslategray',
fill_color= PD['TableColors'][0], align=['center','center'],
font=dict(color='white', size=12), height=25), columnwidth = PD['tablecolumnwidth'],
cells=dict(values=Temp, line_color='darkslategray',
fill=dict(color= [PD['TableColors'][1], PD['TableColors'][1]]),
align=['center', 'center'], font_size=12, height=20)), 1, 1)
fig.update_layout(title={'text': '<b>' + Target + '<b>', 'x':PD['title_x'],
'y':PD['title_y'], 'xanchor': 'center', 'yanchor': 'top'})
fig.show()
Pull = [0 for x in range((len(Labels_dict)-1))]
Pull.append(.05)
PD = dict(PieColors = ['SeaGreen','FireBrick'],
TableColors = ['Navy','White'], hole = .4,
column_widths=[0.6, 0.4],textfont = 14, height = 350, tablecolumnwidth = [0.1, 0.1, 0.1],
pull = Pull, legend_title = Target, title_x = 0.5, title_y = 0.8)
del Pull
DatasetTargetDist(Data, Target, Labels_dict, PD)
StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.
Test_Size = 0.3
sss = StratifiedShuffleSplit(n_splits=1, test_size=Test_Size, random_state=42)
_ = sss.get_n_splits(X, y)
for train_index, test_index in sss.split(X, y):
# X
if isinstance(X, pd.DataFrame):
X_train, X_test = X.loc[train_index], X.loc[test_index]
else:
X_train, X_test = X[train_index], X[test_index]
# y
if isinstance(y, pd.Series):
y_train, y_test = y[train_index], y[test_index]
else:
y_train, y_test = y[train_index], y[test_index]
del sss
def Train_Test_Dist(X_train, y_train, X_test, y_test, PD, Labels_dict = Labels_dict):
def ToSeries(x):
if not isinstance(x, pd.Series):
Out = pd.Series(x)
else:
Out = x.copy()
return Out
fig = make_subplots(rows=1, cols=3, horizontal_spacing = 0.02, column_widths= PD['column_widths'],
specs=[[{"type": "table"},{'type':'domain'}, {'type':'domain'}]])
# Right
C = 2
for y in [ToSeries(y_train).replace(Labels_dict), ToSeries(y_test).replace(Labels_dict)]:
fig.add_trace(go.Pie(labels= list(Labels_dict.values()),
values= y.value_counts().values, pull=PD['pull'],
textfont=dict(size=PD['textfont']),
marker=dict(colors = PD['PieColors'],
line=dict(color='black', width=1))), row=1, col=C)
fig.update_traces(hole=.5)
fig.update_layout(legend=dict(orientation="v"), legend_title_text= PD['legend_title'])
C+=1
# Left
# Table
Table = pd.DataFrame(data={'Set':['X_train','X_test','y_train','y_test'],
'Shape':[X_train.shape, X_test.shape, y_train.shape, y_test.shape]}).astype(str)
T = Table.copy()
Temp = []
for i in T.columns:
Temp.append(T.loc[:,i].values)
TableColors = PD['TableColors']
fig.add_trace(go.Table(header=dict(values = list(Table.columns), line_color='darkslategray',
fill_color= TableColors[0], align=['center','center'],
font=dict(color='white', size=12), height=25), columnwidth = PD['tablecolumnwidth'],
cells=dict(values=Temp, line_color='darkslategray',
fill=dict(color= [TableColors[1], TableColors[1]]),
align=['center', 'center'], font_size=12, height=20)), 1, 1)
fig.update_layout(title={'text': '<b>' + 'Dataset Distribution' + '<b>', 'x':PD['title_x'],
'y':PD['title_y'], 'xanchor': 'center', 'yanchor': 'top'})
if not PD['height'] == None:
fig.update_layout(height = PD['height'])
fig.show()
PD.update(dict(PieColors = ['FireBrick', 'SeaGreen'], column_widths=[0.3, 0.3, 0.3],
tablecolumnwidth = [0.2, 0.4], height = 350, legend_title = Target))
Train_Test_Dist(X_train, y_train, X_test, y_test, PD)
Support-vector machines are supervised learning models that can be used for classification and regression analysis. Please see Support Vector Machines from Statistical Learning, and this link for more details.
def Header(Text, L = 100, C = 'Blue', T = 'White'):
BACK = {'Black': Back.BLACK, 'Red':Back.RED, 'Green':Back.GREEN, 'Yellow': Back.YELLOW, 'Blue': Back.BLUE,
'Magenta':Back.MAGENTA, 'Cyan': Back.CYAN}
FORE = {'Black': Fore.BLACK, 'Red':Fore.RED, 'Green':Fore.GREEN, 'Yellow':Fore.YELLOW, 'Blue':Fore.BLUE,
'Magenta':Fore.MAGENTA, 'Cyan':Fore.CYAN, 'White': Fore.WHITE}
print(BACK[C] + FORE[T] + Style.NORMAL + Text + Style.RESET_ALL + ' ' + FORE[C] +
Style.NORMAL + (L- len(Text) - 1)*'=' + Style.RESET_ALL)
def Line(L=100, C = 'Blue'):
FORE = {'Black': Fore.BLACK, 'Red':Fore.RED, 'Green':Fore.GREEN, 'Yellow':Fore.YELLOW, 'Blue':Fore.BLUE,
'Magenta':Fore.MAGENTA, 'Cyan':Fore.CYAN, 'White': Fore.WHITE}
print(FORE[C] + Style.NORMAL + L*'=' + Style.RESET_ALL)
def Search_List(Key, List): return [s for s in List if Key in s]
def Best_Parm(model, param_dist, Top = None, X = X, y = y, n_splits = 20, scoring = 'precision', H = 600, titleY = .95):
grid = RandomizedSearchCV(estimator = model, param_distributions = param_dist,
cv = StratifiedShuffleSplit(n_splits=n_splits, test_size=Test_Size, random_state=42),
n_iter = int(1e3), scoring = scoring, error_score = 0, verbose = 0,
n_jobs = 10, return_train_score = True)
_ = grid.fit(X, y)
Table = Grid_Table(grid)
if Top == None:
Top = Table.shape[0]
Table = Table.iloc[:Top,:]
# Table
T = Table.copy()
T['Train Score'] = T['Mean Train Score'].map(lambda x: ('%.2e' % x))+ ' ± ' +T['STD Train Score'].map(lambda x: ('%.2e' % x))
T['Test Score'] = T['Mean Test Score'].map(lambda x: ('%.2e' % x))+ ' ± ' +T['STD Test Score'].map(lambda x: ('%.2e' % x))
T['Fit Time'] = T['Mean Fit Time'].map(lambda x: ('%.2e' % x))+ ' ± ' +T['STD Fit Time'].map(lambda x: ('%.2e' % x))
T = T.drop(columns = ['Mean Train Score','STD Train Score','Mean Test Score','STD Test Score','Mean Fit Time','STD Fit Time'])
display(T.head(Top).style.hide_index().background_gradient(subset= ['Rank Test Score'],
cmap=sns.diverging_palette(145, 300, s=60, as_cmap=True)).\
set_properties(subset=['Params'], **{'background-color': 'Indigo', 'color': 'White'}).\
set_properties(subset=['Train Score'], **{'background-color': 'HoneyDew', 'color': 'Black'}).\
set_properties(subset=['Test Score'], **{'background-color': 'Azure', 'color': 'Black'}).\
set_properties(subset=['Fit Time'], **{'background-color': 'Linen', 'color': 'Black'}))
# Plot
Grid_Performance_Plot(Table, n_splits = n_splits, H = H, titleY = titleY)
return grid
def Grid_Table(grid):
Table = pd.DataFrame({'Rank Test Score': grid.cv_results_['rank_test_score'],
'Params':[str(s).replace('{', '').replace('}', '').\
replace("'", '') for s in grid.cv_results_['params']],
# Train
'Mean Train Score': grid.cv_results_['mean_train_score'],
'STD Train Score': grid.cv_results_['std_train_score'],
# Test
'Mean Test Score': grid.cv_results_['mean_test_score'],
'STD Test Score': grid.cv_results_['std_test_score'],
# Fit time
'Mean Fit Time': grid.cv_results_['mean_fit_time'],
'STD Fit Time': grid.cv_results_['std_fit_time']})
Table = Table.sort_values('Rank Test Score').reset_index(drop = True)
return Table
def Grid_Performance_Plot(Table, n_splits, H = 550, titleY =.95):
Temp = Table['Mean Train Score']-Table['STD Train Score']
Temp = np.append(Temp, Table['Mean Test Score']-Table['STD Test Score'])
L = np.floor((Temp*100- Temp)).min()/100
Temp = Table['Mean Train Score']+Table['STD Train Score']
Temp = np.append(Temp, Table['Mean Test Score']+Table['STD Test Score'])
R = np.ceil((Temp*100 + Temp)).max()/100
fig = make_subplots(rows=1, cols=2, horizontal_spacing = 0.02, shared_yaxes=True,
subplot_titles=('<b>' + 'Train Set' + '<b>', '<b>' + 'Test Set' + '<b>'))
fig.add_trace(go.Scatter(x= Table['Params'], y= Table['Mean Train Score'], showlegend=False, marker_color= 'SeaGreen',
error_y=dict(type='data',array=Table['STD Train Score'], visible=True)), 1, 1)
fig.add_trace(go.Scatter(x= Table['Params'], y= Table['Mean Test Score'], showlegend=False, marker_color= 'RoyalBlue',
error_y=dict(type='data',array= Table['STD Test Score'], visible=True)), 1, 2)
fig.update_xaxes(showline=True, linewidth=1, linecolor='Lightgray', mirror=True,
zeroline=False, zerolinewidth=1, zerolinecolor='Black',
showgrid=False, gridwidth=1, gridcolor='Lightgray')
fig.update_yaxes(showline=True, linewidth=1, linecolor='Lightgray', mirror=True,
zeroline=True, zerolinewidth=1, zerolinecolor='Black',
showgrid=True, gridwidth=1, gridcolor='Lightgray', range= [L, R])
fig.update_yaxes(title_text="Mean Score", row=1, col=1)
fig.update_layout(plot_bgcolor= 'white', width = 980, height = H,
title={'text': '<b>' + 'RandomizedSearchCV with %i-fold cross validation' % n_splits + '<b>',
'x':0.5, 'y':titleY, 'xanchor': 'center', 'yanchor': 'top'})
fig.show()
def Stratified_CV_Scoring(model, X = X, y = y, n_splits = 10, Labels = list(Labels_dict.values())):
sss = StratifiedShuffleSplit(n_splits = n_splits, test_size=Test_Size, random_state=42)
if isinstance(X, pd.DataFrame):
X = X.values
if isinstance(y, pd.Series):
y = y.values
_ = sss.get_n_splits(X, y)
Reports_Train = []
Reports_Test = []
CM_Train = []
CM_Test = []
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
_ = model.fit(X_train,y_train)
# Train
y_pred = model.predict(X_train)
R = pd.DataFrame(metrics.classification_report(y_train, y_pred, target_names=Labels, output_dict=True)).T
Reports_Train.append(R.values)
CM_Train.append(metrics.confusion_matrix(y_train, y_pred))
# Test
y_pred = model.predict(X_test)
R = pd.DataFrame(metrics.classification_report(y_test, y_pred, target_names=Labels, output_dict=True)).T
Reports_Test.append(R.values)
CM_Test.append(metrics.confusion_matrix(y_test, y_pred))
# Train
ALL = Reports_Train[0].ravel()
CM = CM_Train[0].ravel()
for i in range(1, len(Reports_Train)):
ALL = np.vstack((ALL, Reports_Train[i].ravel()))
CM = np.vstack((CM, CM_Train[i].ravel()))
Mean = pd.DataFrame(ALL.mean(axis = 0).reshape(R.shape), index = R.index, columns = R.columns)
STD = pd.DataFrame(ALL.std(axis = 0).reshape(R.shape), index = R.index, columns = R.columns)
Reports_Train = Mean.applymap(lambda x: ('%.4f' % x))+ ' ± ' +STD.applymap(lambda x: ('%.4f' % x))
CM_Train = CM.mean(axis = 0).reshape(CM_Train[0].shape).round(0).astype(int)
del ALL, Mean, STD
# Test
ALL = Reports_Test[0].ravel()
CM = CM_Test[0].ravel()
for i in range(1, len(Reports_Test)):
ALL = np.vstack((ALL, Reports_Test[i].ravel()))
CM = np.vstack((CM, CM_Test[i].ravel()))
Mean = pd.DataFrame(ALL.mean(axis = 0).reshape(R.shape), index = R.index, columns = R.columns)
STD = pd.DataFrame(ALL.std(axis = 0).reshape(R.shape), index = R.index, columns = R.columns)
Reports_Test = Mean.applymap(lambda x: ('%.4f' % x))+ ' ± ' +STD.applymap(lambda x: ('%.4f' % x))
CM_Test = CM.mean(axis = 0).reshape(CM_Test[0].shape).round(0).astype(int)
del ALL, Mean, STD
Reports_Train = Reports_Train.reset_index().rename(columns ={'index': 'Train Set (CV = % i)' % n_splits})
Reports_Test = Reports_Test.reset_index().rename(columns ={'index': 'Test Set (CV = % i)' % n_splits})
return Reports_Train, Reports_Test, CM_Train, CM_Test
def Confusion_Mat(CM_Train, CM_Test, PD, n_splits = 10):
if n_splits == None:
Titles = ['Train Set', 'Test Set']
else:
Titles = ['Train Set (CV = % i)' % n_splits, 'Test Set (CV = % i)' % n_splits]
CM = [CM_Train, CM_Test]
Cmap = ['Greens', 'YlGn','Blues', 'PuBu']
for i in range(2):
fig, ax = plt.subplots(1, 2, figsize= PD['FS'])
fig.suptitle(Titles[i], weight = 'bold', fontsize = 16)
_ = sns.heatmap(CM[i], annot=True, annot_kws={"size": PD['annot_kws']}, cmap=Cmap[2*i], ax = ax[0],
linewidths = 0.2, cbar_kws={"shrink": PD['shrink']})
_ = ax[0].set_title('Confusion Matrix');
Temp = np.round(CM[i].astype('float') / CM[i].sum(axis=1)[:, np.newaxis], 2)
_ = sns.heatmap(Temp,
annot=True, annot_kws={"size": PD['annot_kws']}, cmap=Cmap[2*i+1], ax = ax[1],
linewidths = 0.4, vmin=0, vmax=1, cbar_kws={"shrink": PD['shrink']})
_ = ax[1].set_title('Normalized Confusion Matrix');
for a in ax:
_ = a.set_xlabel('Predicted labels')
_ = a.set_ylabel('True labels');
_ = a.xaxis.set_ticklabels(PD['Labels'])
_ = a.yaxis.set_ticklabels(PD['Labels'])
_ = a.set_aspect(1)
def Train_Test_Scores(CM_Train, CM_Test):
CM = [CM_Train, CM_Test]
Sets = ['Train', 'Test']
Colors = ['Green', 'Blue']
for i in range(2):
Header('%s Set' % Sets[i], C = Colors[i])
tn, fp, fn, tp = CM[i].ravel()
Precision = tp/(tp+fp)
Recall = tp/(tp + fn)
TPR = tp/(tp +fn)
TNR = tn/(tn +fp)
BA = (TPR + TNR)/2
print('Precision (%s) = %.2f' % (Sets[i], Precision))
print('Recall (%s) = %.2f' % (Sets[i], Recall))
print('TPR (%s) = %.2f' % (Sets[i], TPR))
print('TNR (%s) = %.2f' % (Sets[i], TNR))
print('Balanced Accuracy (%s) = %.2f' % (Sets[i], BA))
Line()
Some of the metrics that we use here to mesure the accuracy: \begin{align} \text{Confusion Matrix} = \begin{bmatrix}T_p & F_p\\ F_n & T_n\end{bmatrix}. \end{align}
where $T_p$, $T_n$, $F_p$, and $F_n$ represent true positive, true negative, false positive, and false negative, respectively.
\begin{align} \text{Precision} &= \frac{T_{p}}{T_{p} + F_{p}},\\ \text{Recall} &= \frac{T_{p}}{T_{p} + F_{n}},\\ \text{F1} &= \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\\ \text{Balanced-Accuracy (bACC)} &= \frac{1}{2}\left( \frac{T_{p}}{T_{p} + F_{n}} + \frac{T_{n}}{T_{n} + F_{p}}\right ) \end{align}The accuracy can be a misleading metric for imbalanced data sets. In these cases, a balanced accuracy (bACC) [4] is recommended that normalizes true positive and true negative predictions by the number of positive and negative samples, respectively, and divides their sum by two.
Name = 'Support Vector Machine'
Header('%s with Default Parameters' % Name)
n_splits = 20
SVM = SVC()
print('Default Parameters = %s' % SVM.get_params(deep=True))
_ = SVM.fit(X_train, y_train)
##
Reports_Train, Reports_Test, CM_Train, CM_Test = Stratified_CV_Scoring(SVM, X = X, y = y, n_splits = n_splits)
display(Reports_Train.style.hide_index().set_properties(**{'background-color': 'HoneyDew', 'color': 'Black'}).\
set_properties(subset=['Train Set (CV = % i)' % n_splits], **{'background-color': 'SeaGreen', 'color': 'White'}))
display(Reports_Test.style.hide_index().set_properties(**{'background-color': 'Azure', 'color': 'Black'}).\
set_properties(subset=['Test Set (CV = % i)' % n_splits], **{'background-color': 'RoyalBlue', 'color': 'White'}))
PD = dict(FS = (10, 5), annot_kws = 14, shrink = .6, Labels = list(Labels_dict.values()))
Confusion_Mat(CM_Train, CM_Test, PD = PD, n_splits = n_splits)
Train_Test_Scores(CM_Train, CM_Test)
Support Vector Machine with Default Parameters ===================================================== Default Parameters = {'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}
| Train Set (CV = 20) | precision | recall | f1-score | support |
|---|---|---|---|---|
| Malignant | 0.9958 ± 0.0046 | 0.9645 ± 0.0067 | 0.9799 ± 0.0047 | 148.0000 ± 0.0000 |
| Benign | 0.9794 ± 0.0038 | 0.9976 ± 0.0027 | 0.9884 ± 0.0027 | 250.0000 ± 0.0000 |
| accuracy | 0.9853 ± 0.0034 | 0.9853 ± 0.0034 | 0.9853 ± 0.0034 | 0.9853 ± 0.0034 |
| macro avg | 0.9876 ± 0.0034 | 0.9811 ± 0.0040 | 0.9842 ± 0.0037 | 398.0000 ± 0.0000 |
| weighted avg | 0.9855 ± 0.0034 | 0.9853 ± 0.0034 | 0.9853 ± 0.0034 | 398.0000 ± 0.0000 |
| Test Set (CV = 20) | precision | recall | f1-score | support |
|---|---|---|---|---|
| Malignant | 0.9740 ± 0.0154 | 0.9563 ± 0.0240 | 0.9648 ± 0.0141 | 64.0000 ± 0.0000 |
| Benign | 0.9743 ± 0.0136 | 0.9846 ± 0.0095 | 0.9793 ± 0.0080 | 107.0000 ± 0.0000 |
| accuracy | 0.9740 ± 0.0102 | 0.9740 ± 0.0102 | 0.9740 ± 0.0102 | 0.9740 ± 0.0102 |
| macro avg | 0.9742 ± 0.0100 | 0.9704 ± 0.0124 | 0.9721 ± 0.0111 | 171.0000 ± 0.0000 |
| weighted avg | 0.9742 ± 0.0100 | 0.9740 ± 0.0102 | 0.9739 ± 0.0103 | 171.0000 ± 0.0000 |
Train Set ========================================================================================== Precision (Train) = 0.98 Recall (Train) = 1.00 TPR (Train) = 1.00 TNR (Train) = 0.97 Balanced Accuracy (Train) = 0.98 Test Set =========================================================================================== Precision (Test) = 0.97 Recall (Test) = 0.98 TPR (Test) = 0.98 TNR (Test) = 0.95 Balanced Accuracy (Test) = 0.97 ====================================================================================================
In order to find the parameters for our model, we can sue RandomizedSearchCV. Here, we have defined a function Best_Parm to find the best parameters.
SVM = SVC()
param_dist = dict(C = [1e3], kernel = ['poly', 'rbf', 'sigmoid'], class_weight= [None, 'balanced'],
gamma = ['scale', 'auto', 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1] )
Header('%s with the Best Parameters' % Name)
grid = Best_Parm(model = SVM, param_dist = param_dist, Top = 20, H = 800)
Support Vector Machine with the Best Parameters ====================================================
| Rank Test Score | Params | Train Score | Test Score | Fit Time |
|---|---|---|---|---|
| 1 | kernel: sigmoid, gamma: 0.0001, class_weight: balanced, C: 1000.0 | 9.85e-01 ± 4.09e-03 | 9.80e-01 ± 1.47e-02 | 5.15e-03 ± 4.77e-04 |
| 2 | kernel: rbf, gamma: 0.0001, class_weight: balanced, C: 1000.0 | 9.86e-01 ± 3.41e-03 | 9.79e-01 ± 1.34e-02 | 5.30e-03 ± 9.54e-04 |
| 3 | kernel: rbf, gamma: 0.0005, class_weight: balanced, C: 1000.0 | 9.88e-01 ± 3.92e-03 | 9.78e-01 ± 1.48e-02 | 5.35e-03 ± 5.72e-04 |
| 4 | kernel: sigmoid, gamma: 0.001, class_weight: balanced, C: 1000.0 | 9.87e-01 ± 4.38e-03 | 9.78e-01 ± 1.69e-02 | 5.85e-03 ± 6.54e-04 |
| 5 | kernel: sigmoid, gamma: 0.0005, class_weight: balanced, C: 1000.0 | 9.87e-01 ± 3.97e-03 | 9.77e-01 ± 1.42e-02 | 5.10e-03 ± 1.09e-03 |
| 6 | kernel: rbf, gamma: 0.1, class_weight: None, C: 1000.0 | 1.00e+00 ± 0.00e+00 | 9.75e-01 ± 1.24e-02 | 8.50e-03 ± 9.22e-04 |
| 6 | kernel: rbf, gamma: 0.1, class_weight: balanced, C: 1000.0 | 1.00e+00 ± 0.00e+00 | 9.75e-01 ± 1.24e-02 | 8.30e-03 ± 6.41e-04 |
| 8 | kernel: rbf, gamma: scale, class_weight: None, C: 1000.0 | 1.00e+00 ± 0.00e+00 | 9.75e-01 ± 1.80e-02 | 5.90e-03 ± 8.89e-04 |
| 8 | kernel: rbf, gamma: scale, class_weight: balanced, C: 1000.0 | 1.00e+00 ± 0.00e+00 | 9.75e-01 ± 1.80e-02 | 6.00e-03 ± 8.37e-04 |
| 10 | kernel: rbf, gamma: auto, class_weight: balanced, C: 1000.0 | 1.00e+00 ± 0.00e+00 | 9.75e-01 ± 1.76e-02 | 5.75e-03 ± 5.36e-04 |
| 10 | kernel: rbf, gamma: auto, class_weight: None, C: 1000.0 | 1.00e+00 ± 0.00e+00 | 9.75e-01 ± 1.76e-02 | 5.85e-03 ± 7.27e-04 |
| 12 | kernel: rbf, gamma: 0.005, class_weight: balanced, C: 1000.0 | 1.00e+00 ± 0.00e+00 | 9.74e-01 ± 1.66e-02 | 5.35e-03 ± 4.77e-04 |
| 13 | kernel: sigmoid, gamma: 0.001, class_weight: None, C: 1000.0 | 9.85e-01 ± 4.23e-03 | 9.74e-01 ± 1.80e-02 | 4.65e-03 ± 4.77e-04 |
| 14 | kernel: rbf, gamma: 0.001, class_weight: None, C: 1000.0 | 9.88e-01 ± 4.81e-03 | 9.73e-01 ± 1.59e-02 | 5.25e-03 ± 6.99e-04 |
| 15 | kernel: rbf, gamma: 0.001, class_weight: balanced, C: 1000.0 | 9.90e-01 ± 4.22e-03 | 9.73e-01 ± 1.43e-02 | 5.40e-03 ± 7.35e-04 |
| 16 | kernel: rbf, gamma: 0.0005, class_weight: None, C: 1000.0 | 9.85e-01 ± 4.44e-03 | 9.73e-01 ± 1.72e-02 | 5.30e-03 ± 7.81e-04 |
| 17 | kernel: rbf, gamma: 0.005, class_weight: None, C: 1000.0 | 1.00e+00 ± 0.00e+00 | 9.73e-01 ± 1.58e-02 | 5.40e-03 ± 1.20e-03 |
| 18 | kernel: rbf, gamma: 0.01, class_weight: None, C: 1000.0 | 1.00e+00 ± 0.00e+00 | 9.73e-01 ± 2.02e-02 | 4.85e-03 ± 5.72e-04 |
| 18 | kernel: rbf, gamma: 0.01, class_weight: balanced, C: 1000.0 | 1.00e+00 ± 0.00e+00 | 9.73e-01 ± 2.02e-02 | 5.35e-03 ± 5.72e-04 |
| 20 | kernel: rbf, gamma: 0.0001, class_weight: None, C: 1000.0 | 9.81e-01 ± 4.43e-03 | 9.72e-01 ± 1.80e-02 | 4.85e-03 ± 4.77e-04 |
Since we have identified the best parameters for our modeling, we train another model using these parameters.
Header('%s with the Best Parameters' % Name)
SVM = SVC(**grid.best_params_)
print('Default Parameters = %s' % SVM.get_params(deep=True))
_ = SVM.fit(X_train, y_train)
Reports_Train, Reports_Test, CM_Train, CM_Test = Stratified_CV_Scoring(SVM, X = X, y = y, n_splits = n_splits)
display(Reports_Train.style.hide_index().set_properties(**{'background-color': 'HoneyDew', 'color': 'Black'}).\
set_properties(subset=['Train Set (CV = % i)' % n_splits], **{'background-color': 'DarkGreen', 'color': 'White'}))
display(Reports_Test.style.hide_index().set_properties(**{'background-color': 'Azure', 'color': 'Black'}).\
set_properties(subset=['Test Set (CV = % i)' % n_splits], **{'background-color': 'MediumBlue', 'color': 'White'}))
Confusion_Mat(CM_Train, CM_Test, PD = PD, n_splits = n_splits)
Train_Test_Scores(CM_Train, CM_Test)
Support Vector Machine with the Best Parameters ==================================================== Default Parameters = {'C': 1000.0, 'break_ties': False, 'cache_size': 200, 'class_weight': 'balanced', 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 0.0001, 'kernel': 'sigmoid', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}
| Train Set (CV = 20) | precision | recall | f1-score | support |
|---|---|---|---|---|
| Malignant | 0.9854 ± 0.0065 | 0.9747 ± 0.0070 | 0.9800 ± 0.0050 | 148.0000 ± 0.0000 |
| Benign | 0.9851 ± 0.0041 | 0.9914 ± 0.0039 | 0.9882 ± 0.0030 | 250.0000 ± 0.0000 |
| accuracy | 0.9852 ± 0.0037 | 0.9852 ± 0.0037 | 0.9852 ± 0.0037 | 0.9852 ± 0.0037 |
| macro avg | 0.9852 ± 0.0040 | 0.9830 ± 0.0042 | 0.9841 ± 0.0040 | 398.0000 ± 0.0000 |
| weighted avg | 0.9852 ± 0.0037 | 0.9852 ± 0.0037 | 0.9852 ± 0.0037 | 398.0000 ± 0.0000 |
| Test Set (CV = 20) | precision | recall | f1-score | support |
|---|---|---|---|---|
| Malignant | 0.9709 ± 0.0220 | 0.9656 ± 0.0260 | 0.9678 ± 0.0147 | 64.0000 ± 0.0000 |
| Benign | 0.9798 ± 0.0147 | 0.9822 ± 0.0138 | 0.9809 ± 0.0087 | 107.0000 ± 0.0000 |
| accuracy | 0.9760 ± 0.0109 | 0.9760 ± 0.0109 | 0.9760 ± 0.0109 | 0.9760 ± 0.0109 |
| macro avg | 0.9753 ± 0.0115 | 0.9739 ± 0.0128 | 0.9744 ± 0.0117 | 171.0000 ± 0.0000 |
| weighted avg | 0.9764 ± 0.0106 | 0.9760 ± 0.0109 | 0.9760 ± 0.0109 | 171.0000 ± 0.0000 |
Train Set ========================================================================================== Precision (Train) = 0.98 Recall (Train) = 0.99 TPR (Train) = 0.99 TNR (Train) = 0.97 Balanced Accuracy (Train) = 0.98 Test Set =========================================================================================== Precision (Test) = 0.98 Recall (Test) = 0.98 TPR (Test) = 0.98 TNR (Test) = 0.97 Balanced Accuracy (Test) = 0.98 ====================================================================================================